Association between two measurements. As scientists, we often try to decipher the relationship between phenomena. Is thing 1 associated with thing 2? Or ultimately, does thing 1 influence thing 2? does thing 1 cause thing 2.
Anderson: catalog natural variation in Iris species.
Fisher: figure out how to classify species by different relationships.
By Frank Mayfield - originally posted to Flickr as Iris virginica shrevei BLUE FLAG, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=9805580
Setosa attribution: CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=170298
# Fisher's Iris data
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
cor(iris$Petal.Length,iris$Petal.Width)
## [1] 0.9628654
# Correlation is high, but there is more than one group.
# Fisher was trying to build a classifier
#with(iris, plot(Petal.Length, Petal.Width, col=Species, main="iris dataset"))
gp = ggplot(iris,
aes(Petal.Length, Petal.Width, col=Species)) +
geom_point() +
ggtitle("iris dataset")
print(gp) # plots the previous commands
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
iris.mod = lm(Petal.Width ~ Petal.Length, data=iris)
# add the regression line to the plot
gp + geom_abline(intercept = iris.mod$coefficients['(Intercept)'],
slope = iris.mod$coefficients['Petal.Length'])
summary(iris.mod)
##
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56515 -0.12358 -0.01898 0.13288 0.64272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.363076 0.039762 -9.131 4.7e-16 ***
## Petal.Length 0.415755 0.009582 43.387 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
0.415755, fit by least squares.0.9266.The last line shows the F-statistic/ANOVA. You can perform that test directly on the model via anova(iris.mod), or aov(iris.mod) for a slightly different interface.
Build up linear regression: * from a straight line * adding noise * adding outliers * adding groups
How does linear regression perform on these types of data, and what do we do when we encounter it?
Let’s reverse engineer the process of regression a little, and then start shaking it up.
We’ll start by plotting points on a perfect line, then add variation to those points.
# equation for a line is: y = mx + b
m = .5 # choose a slope
b = 3
# plot points on a line
N = 20 # number of points
x = 1:N
y = m * x + b
plot(x, y, main="Points along a line")
# The fit returns the parameters m and b
perfect = lm(y ~ x)
perfect
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 3.0 0.5
# summary(perfect)
# In summary.lm(perfect) : essentially perfect fit: summary may be unreliable
# add the fit line to the plot
plot(x, y, main="Perfect Line with fit")
abline(perfect) # "perfect" is the variable returned by "lm(y ~ x)"
# linear regression equation is: y = B0 + B1x + e
# the values of y deviate from the line by the error term e
set.seed(0) # try a different number for different randoms
y = b + m * x + rnorm(N, mean=0, sd=1) # the error is normally distributed, mean = 0, standard deviation = 1
plot(x, y,
xlim=c(0, max(x)),
ylim=c(0, max(y)),
pch=20,
main="Plot with normally distributed error added")
abline(h=0) # horizontal axis
abline(v=0) # vertical axis
model = lm(y ~ x)
abline(model) # add the fitted line to the plot
segments(x, y, x, predict(model), col="red") # add residuals to plot
# Plot: A line with (more) variation
set.seed(0)
y = b + m * x + rnorm(N, mean=0, sd=3) # notice the increase in standard deviation
plot(x, y,
xlim=c(0, max(x)),
ylim=c(min(0,y), max(y)),
pch=20,
main="straight line, more error")
# axes
abline(h=0) # horiz
abline(v=0) # vert
# fit the line
model.wider = lm(y ~ x)
abline(model.wider)
segments(x, y, x, predict(model.wider), col="red")
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81859 -0.65877 -0.06829 0.70369 2.37527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.65252 0.45466 8.034 2.31e-07 ***
## x 0.43769 0.03795 11.532 9.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9788 on 18 degrees of freedom
## Multiple R-squared: 0.8808, Adjusted R-squared: 0.8742
## F-statistic: 133 on 1 and 18 DF, p-value: 9.55e-10
summary(model.wider)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4558 -1.9763 -0.2049 2.1111 7.1258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9576 1.3640 3.635 0.0019 **
## x 0.3131 0.1139 2.749 0.0132 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 18 degrees of freedom
## Multiple R-squared: 0.2958, Adjusted R-squared: 0.2566
## F-statistic: 7.559 on 1 and 18 DF, p-value: 0.01319
Let’s add an outlier.
### Use a tighter spread.
set.seed(0)
y = b + m * x + rnorm(N, mean=0, sd=2)
### model2: x2, y2. Add a single outlier.
x2 = c(x, 1.5)
y2 = c(y, 12)
model2 = lm(y2 ~ x2)
mod.diag = augment(model2) # 'augment' is from package 'broom'- it calculates diagnostics
head(mod.diag)
## # A tibble: 6 x 8
## y2 x2 .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.03 1 5.77 0.259 0.159 2.50 0.00127 0.116
## 2 3.35 2 6.06 -2.71 0.135 2.40 0.113 -1.20
## 3 7.16 3 6.35 0.806 0.115 2.49 0.00807 0.352
## 4 7.54 4 6.65 0.899 0.0973 2.49 0.00815 0.389
## 5 6.33 5 6.94 -0.610 0.0823 2.49 0.00307 -0.262
## 6 2.92 6 7.23 -4.31 0.0700 2.27 0.127 -1.84
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N) # this counts the number satisfying "cookds > 4/N"
## [1] 1
# Plot: We see the outlier in the upper left shift the line slightly.
plot(x2, y2,
xlim=c(0, max(x2)),
ylim=c(min(0,y2), max(y2)),
pch=ifelse(cooksd > 4/N, 4,20), # 'X' for the outliers (pch=4), pch=1 otherwise
main="Outlier added (x)",
sub="dashed: model w/o outlier"
)
abline(h=0); abline(v=0)
abline(model, lty=2)
abline(model2)
segments(x2, y2, x2, predict(model2), col="red")
### model3: x3,y3. Add another outlier on the lower right
x3 = c(x2, 20.5)
y3 = c(y2, 0)
model3 = lm(y3 ~ x3)
mod.diag = augment(model3) # 'augment' is from package 'broom'
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N)
## [1] 2
cooksd[cooksd > 4/N]
## [1] 0.2516166 0.9620788
# Plot: We see that this new outlier pulls the slope drastically downward.
plot(x3, y3,
xlim=c(0, max(x3)),
ylim=c(min(0,y3), max(y3)),
pch=ifelse(cooksd > 4/N, 4,20), # 'X' for the outliers (pch=4), pch=20 otherwise
main="Second outlier added",
sub="dashed: model w/o outliers")
abline(h=0) # x-axis
abline(v=0) # y-axis
abline(model, lty=2) # original fitted line (dashed, lty=2), before outliers
abline(model3) # line with outliers added, (solid)
segments(x3, y3, x3, predict(model3), col="red") # residuals
set.seed(0)
# model4: x4, y4. Random cloud + extreme outlier.
# a random cloud centered at 10,5
x4 = rnorm(N, 10)
y4 = rnorm(N, 5)
# single outlier in lower right
x4 = c(x4, 20)
y4 = c(y4, 0)
model4 = lm(y4 ~ x4)
# regression diagnostics
mod.diag = augment(model4)
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N)
## [1] 1
cooksd[cooksd > 4/N]
## [1] 16.98731
summary(model4)
##
## Call:
## lm(formula = y4 ~ x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3515 -0.8630 0.0499 0.6371 1.3540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.20107 0.88769 10.365 2.94e-09 ***
## x4 -0.41366 0.08271 -5.001 7.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8873 on 19 degrees of freedom
## Multiple R-squared: 0.5683, Adjusted R-squared: 0.5456
## F-statistic: 25.01 on 1 and 19 DF, p-value: 7.924e-05
plot(x4,y4,
pch=ifelse(cooksd > 4/N, 4,1), # 'X' for the outliers (pch=4), pch=1 otherwise
main="Random cloud with outlier added", # give an 'X' for the outliers
sub="dashed: model w/o outliers")
abline(model4)
# model4.censored: a new model ignoring the outliers (via Cook's d)
df = data.frame(x=x4,y=y4)
model4.censored = lm(y ~ x, data=df[cooksd < 4/N,])
abline(model4.censored, lty=2) # dashed line: model with outliers removed.
# model5: x5,y5. More outliers in the lower right.
# add more
x5 = c(x4, 18.5)
y5 = c(y4, 0.8)
x5 = c(x5, 19.5)
y5 = c(y5, 2)
model5 = lm(y5 ~ x5)
mod.diag = augment(model5)
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N)
## [1] 2
cooksd[cooksd > 4/N]
## [1] 0.5417482 0.2878865
summary(model5)
##
## Call:
## lm(formula = y5 ~ x5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34814 -0.84870 0.05036 0.72489 1.35432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.12865 0.65251 13.990 4.09e-12 ***
## x5 -0.40675 0.05583 -7.285 3.57e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8798 on 21 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.703
## F-statistic: 53.08 on 1 and 21 DF, p-value: 3.569e-07
plot(x5,y5,
pch=ifelse(cooksd > 4/N, 4,1), # 'X' for the outliers (pch=4), pch=1 otherwise
main="Outliers close together",
sub="dashed: model w/o outliers")
abline(model5)
df = data.frame(x=x5,y=y5)
model5.censored = lm(y ~ x, data=df[cooksd < 4/N,])
abline(model5.censored, lty=2)
summary(model4.censored)
##
## Call:
## lm(formula = y ~ x, data = df[cooksd < 4/N, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3960 -0.4850 -0.1938 0.6252 1.1028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.56682 1.65739 2.755 0.013 *
## x 0.05449 0.16495 0.330 0.745
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7345 on 18 degrees of freedom
## Multiple R-squared: 0.006026, Adjusted R-squared: -0.04919
## F-statistic: 0.1091 on 1 and 18 DF, p-value: 0.7449
summary(model5.censored)
##
## Call:
## lm(formula = y ~ x, data = df[cooksd < 4/N, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34693 -0.82997 0.04098 0.67133 1.34442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.87576 0.98601 9.002 2.78e-08 ***
## x -0.38156 0.09298 -4.104 0.000605 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8756 on 19 degrees of freedom
## Multiple R-squared: 0.4699, Adjusted R-squared: 0.4419
## F-statistic: 16.84 on 1 and 19 DF, p-value: 0.0006048
ChIP-seq for chromatin marks. Value is relative signal in a mutant background versus wild type, log scale.
The measurement is done genome wide by an experiment called ChIP-seq, which used antibodies specific histone mark to capture the DNA that presents that mark, and sequences the capture regions. Those regions are then identified by their sequence using next-generation sequencing, and the intensity is related to the amount the histones are modified and how often the region is captured in the population of isolated sample.
Original data
# ChIP-seq for chromatin marks. Value is relative signal in a mutant background versus wildtype, log scale.
# intensity of repressive mark for a set of genes (each value = gene)
h3k9me2 = c(-2.25, -1.75, -1.83, -1.17, 0.04, -1.37, -0.97, -2.06, -0.39, -0.40, -1.34, -1.36, -1.32, -1.42, -0.98, -1.12, -1.06, -1.32, -1.32, -0.06, -0.17, -0.03, -0.22, -0.03, -0.15, -0.05, -0.23, 0.06, -0.12, -0.05, 0.20, 0.10, -0.27, -0.39, -0.33, -0.39, -0.46, -0.55, -0.53, 0.26, 0.35, 0.20, 0.11, 0.06, 0.05, -0.32, -0.30, -0.62)
# intensity of activating mark for a set of genes (each value = gene)
h3k4me3 = c(2.25, 1.97, 0.95, 0.52, -0.07, 0.56, 0.76, 1.31, 0.52, 0.87, 0.44, 0.89, 0.96, 0.20, 0.10, -0.08, 1.81, 1.55, 1.50, -0.08, -0.06, -0.08, -0.11, -0.01, -0.09, -0.07, -0.07, -0.07, -0.02, 0.06, -0.10, -0.13, -0.14, -0.15, -0.04, 0.01, 0.06, 0.01, -0.09, -0.25, -0.25, 0.02, 0.07, 0.22, 0.17, 0.12, 0.27, 0.20)
plot(h3k9me2,
h3k4me3,
xlim=c(-3,3),
ylim=c(-3,3),
xlab="Repressing Signal: (H3K9me2) mutant / WT)",
ylab="Activating Signal (H3K4me3) mutant / WT)",
pty='s',
main="HTA-germline genes")
abline(h=0)
abline(v=0)
text(2, -2, sprintf("r = %.3f", cor(h3k9me2,h3k4me3)))
df = data.frame(h3k9me2, h3k4me3)
lm.1 = lm(h3k4me3 ~ h3k9me2, data=df)
lm.1.diag = augment(lm.1)
cooksd = lm.1.diag$.cooksd
plot(h3k9me2,
h3k4me3,
xlim=c(-3,3),
ylim=c(-3,3),
xlab="Repressing Signal: (H3K9me2) mutant / WT)",
ylab="Activating Signal (H3K4me3) mutant / WT)",
pch=ifelse(cooksd > 4/N, 4,1),
pty='s',
main="HTA-germline genes (original model plus outliers removed)",
sub="dashed: outliers removed")
abline(h=0)
abline(v=0)
# original
abline(lm.1)
# with outliers removed
lm.1.censored = lm(h3k4me3 ~ h3k9me2, data=df[cooksd < 4/N,])
abline(lm.1.censored, lty=2)
Question: Is there a better way to distinguish the correlation between histone marks for genes affected by the mutant?
Searching around for regression diagnostics, trying to describe the problem in google, I eventually turn up Simpson’s paradox.
Simpson’s paradox (Yule-Simpson effect): A trend within groups deviates, or even contradicts, a trend seen when the groups are aggregated.
##Illustration
By Pace~svwiki - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=62007681
Clusters may be provided, or calculated using mixture modeling. It then runs a type of modeling called beta regression on the different clusters and compares to the overall model.
ggplot(simp.histone$alldata, aes(X,Y,color=factor(clusterid))) +
geom_point() +
geom_abline(
slope=simp.histone$Allbeta[1],
intercept = simp.histone$Allint[1],
color="#F8766D") +
geom_abline(
slope=simp.histone$groupbeta,
intercept = simp.histone$groupint,
color="grey", size = 2, alpha = .5) +
geom_abline(
slope=simp.histone$Allbeta[2],
intercept = simp.histone$Allint[2],
color="#00BFC4") # you can find the 1st 2 ggplot colors with hue_pal()(2)
correlation_by_group = simp.histone$alldata %>%
group_by(clusterid) %>%
summarize(cor(X,Y))
## `summarise()` ungrouping output (override with `.groups` argument)
print(correlation_by_group)
## # A tibble: 2 x 2
## clusterid `cor(X, Y)`
## <dbl> <dbl>
## 1 1 -0.495
## 2 2 -0.318
The intrinsic clustering did a good job of separating the groups.
The outcome is: cluster 1 has a slope of -0.6798802, not as sharp as the original -0.7421426, whereas the second cluster is much flatter -0.1589689.
The correlation coefficients show that the separation of the groups caused the high correlation.
coef(simp.histone)
## N Int Beta Pval
## Alldata 48 -0.07975478 -0.7421426 1.371252e-11
## Cluster 1 18 0.06391150 -0.6798802 3.664584e-02
## Cluster 2 30 -0.04268135 -0.1589689 8.719577e-02
summary(simp.histone)
## $regressionResults
## N Int Beta Pval
## Alldata 48 -0.07975478 -0.7421426 1.371252e-11
## Cluster 1 18 0.06391150 -0.6798802 3.664584e-02
## Cluster 2 30 -0.04268135 -0.1589689 8.719577e-02
##
## $mclustResults
## 'Mclust' model object: (VEV,2)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
cluster(simp.histone)
## $`Cluster 1`
## X Y clusterid
## 1 -2.25 2.25 1
## 2 -1.75 1.97 1
## 3 -1.83 0.95 1
## 4 -1.17 0.52 1
## 6 -1.37 0.56 1
## 7 -0.97 0.76 1
## 8 -2.06 1.31 1
## 9 -0.39 0.52 1
## 10 -0.40 0.87 1
## 11 -1.34 0.44 1
## 12 -1.36 0.89 1
## 13 -1.32 0.96 1
## 14 -1.42 0.20 1
## 15 -0.98 0.10 1
## 16 -1.12 -0.08 1
## 17 -1.06 1.81 1
## 18 -1.32 1.55 1
## 19 -1.32 1.50 1
##
## $`Cluster 2`
## X Y clusterid
## 5 0.04 -0.07 2
## 20 -0.06 -0.08 2
## 21 -0.17 -0.06 2
## 22 -0.03 -0.08 2
## 23 -0.22 -0.11 2
## 24 -0.03 -0.01 2
## 25 -0.15 -0.09 2
## 26 -0.05 -0.07 2
## 27 -0.23 -0.07 2
## 28 0.06 -0.07 2
## 29 -0.12 -0.02 2
## 30 -0.05 0.06 2
## 31 0.20 -0.10 2
## 32 0.10 -0.13 2
## 33 -0.27 -0.14 2
## 34 -0.39 -0.15 2
## 35 -0.33 -0.04 2
## 36 -0.39 0.01 2
## 37 -0.46 0.06 2
## 38 -0.55 0.01 2
## 39 -0.53 -0.09 2
## 40 0.26 -0.25 2
## 41 0.35 -0.25 2
## 42 0.20 0.02 2
## 43 0.11 0.07 2
## 44 0.06 0.22 2
## 45 0.05 0.17 2
## 46 -0.32 0.12 2
## 47 -0.30 0.27 2
## 48 -0.62 0.20 2
plot(simp.histone)
It’s not the full-fledged paradox because the signs aren’t reversed, but the groups in the data have different associations,or some have an association where others don’t.